Analysis of White Wine Quality by Hyoung-Gyu Lee

Data Overview

This report explores a dataset containing 11 chemical aspects of 4898 white wines and their quality measured by experts. There are no missing values.

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Plots Section

The Maximum Value for Quality is 9 and the Minimum value is 3. However, the vast majority of wine is in the 5 to 7 range. The variation looks close to normal.

However, these are ratings measured by experts. In other words, they are ordinal values; a wine with rating 9 is NOT three times better than a wine that received a score of 3. To better analyze the data set, I decided to add a new variable that classifies a wine into one of three categories: “Low” “Average” and “Excellent” A wine with a score below six (so 5 or less) is categorized as a low quality wine, while a wine with score 6 is considered to be average, and 7, 8, 9 being excellent.

## 
##       Low   Average Excellent 
##      1640      2198      1060

Now, here are some graphic displays of distributions of chemical characteristics.

For variables fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide, it was quite visible that few outliers distort general shapes of the distributions. Therefore, new graphs were drwan excluding values outside the 1~99th percentile range.

The distributions for volatile acidity, free sulfur dioxide, total sulfur dioxide, pH, sulphates and fixed acidity seemed close to normal when outliers were excluded.

However, there was a strange level of citric acid that unusually many wines shared. 215 wines had 0.49 g/dm^3 of citric acid in it. This is very peculiar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The distribution for residual sugar is also interesting. It is skewed to the right to a high degree. However, when log transform was conducted to the distribution, it was visible that the distribution was bimodal. There was only one wine however, that had more than 45 g/dm^3 of residual sugar; therefore, there was only one wine that could be considered “sweet” according to the description in the data set.

Amount of chlorides in a wine take a distribution that resembles a normal curve closely. However, it is notable that there are many wines beyond 0.08 level distributed consistently.

Alcohol levels in wines range from 8 to about 14. The distribution looks slightly skewed to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density of wines is very close to 1. Largest variation is only 0.0390. Nonetheless, there are some variations due to chemical compounds in wines.

Univariate Analysis

What is the structure of your dataset?

This dataset is a dataframe with 4989 rows and 13 columns. Therefore, it has data of 4989 white wines and 13 characteristics associated with them. Although there are 13 variables in the data set, since variable X is just a numbering of all the wines, there are essentially 12 variables in the data set. Moreover, 11 variables such as pH, alcohol, etc. are independent variables and 1 vairable is a dependent variable or a resulting variable which is the quality variable.

1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3)

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

10 - sulphates (potassium sulphate - g / dm3)

11 - alcohol (% by volume)

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

At first, I wanted to find a variable or variables that are most closely related to the quality of wine. However, as I will explain through out this report, such attempt was futile. In any case, I had some variables that I suspected to have high correlations. I will focus on residual sugar, citric acid, and chlorides as they are chemical compositions that determine important aspects of tastes: sweetness, flavor, and saltiness of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Intuitively, density and alcohol are variables that do not seem to affect quality of wine. Rather, they will be resulting variables that are affected by other variables such as residual sugar and sulphates. It will interesting to see how the independent variables are actually correlated.

Did you create any new variables from existing variables in the dataset?

I created a categorical variable called “Level.” As I mentioned, as the quality variable is an ordinal variable, using categorical variable may help visualize which factor affects quality. Therefore, I created three categories “Low”, “Average”, and “Excellent” and placed wines accordingly.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

For many distributions, a few outliers distorted the shape of distributions. Therefore, I chose to graph data within 1% to 99% percentile range. Secondly, I log-transformed residual sugar data. The distribution for residual sugar was highly skewed to the right. Log transforming this data revealed that the distribution was bimodal.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.02        0.29
## volatile.acidity             -0.02             1.00       -0.15
## citric.acid                   0.29            -0.15        1.00
## residual.sugar                0.09             0.06        0.09
## chlorides                     0.02             0.07        0.11
## free.sulfur.dioxide          -0.05            -0.10        0.09
## total.sulfur.dioxide          0.09             0.09        0.12
## density                       0.27             0.03        0.15
## pH                           -0.43            -0.03       -0.16
## sulphates                    -0.02            -0.04        0.06
## alcohol                      -0.12             0.07       -0.08
## quality                      -0.11            -0.19       -0.01
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.09      0.02               -0.05
## volatile.acidity               0.06      0.07               -0.10
## citric.acid                    0.09      0.11                0.09
## residual.sugar                 1.00      0.09                0.30
## chlorides                      0.09      1.00                0.10
## free.sulfur.dioxide            0.30      0.10                1.00
## total.sulfur.dioxide           0.40      0.20                0.62
## density                        0.84      0.26                0.29
## pH                            -0.19     -0.09                0.00
## sulphates                     -0.03      0.02                0.06
## alcohol                       -0.45     -0.36               -0.25
## quality                       -0.10     -0.21                0.01
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                        0.09    0.27 -0.43     -0.02   -0.12
## volatile.acidity                     0.09    0.03 -0.03     -0.04    0.07
## citric.acid                          0.12    0.15 -0.16      0.06   -0.08
## residual.sugar                       0.40    0.84 -0.19     -0.03   -0.45
## chlorides                            0.20    0.26 -0.09      0.02   -0.36
## free.sulfur.dioxide                  0.62    0.29  0.00      0.06   -0.25
## total.sulfur.dioxide                 1.00    0.53  0.00      0.13   -0.45
## density                              0.53    1.00 -0.09      0.07   -0.78
## pH                                   0.00   -0.09  1.00      0.16    0.12
## sulphates                            0.13    0.07  0.16      1.00   -0.02
## alcohol                             -0.45   -0.78  0.12     -0.02    1.00
## quality                             -0.17   -0.31  0.10      0.05    0.44
##                      quality
## fixed.acidity          -0.11
## volatile.acidity       -0.19
## citric.acid            -0.01
## residual.sugar         -0.10
## chlorides              -0.21
## free.sulfur.dioxide     0.01
## total.sulfur.dioxide   -0.17
## density                -0.31
## pH                      0.10
## sulphates               0.05
## alcohol                 0.44
## quality                 1.00

Study of correlations between all the variables reveal that for most variables, the correlations were actually quite low. This is especially notable for correlation between quality and other variables. This is what is expected. I suspected earlier that quality is not likely to be correlated to a single variable.

When all the correlations deemed insignificant (alpha=0.01) are crossed out, not many correlations survive. However, nonetheless, the correlations provide some interesting insights. We will look at some higher correlations and their implications.

##                     row               column           cor            p
## 25       residual.sugar              density  0.8389666080 0.000000e+00
## 53              density              alcohol -0.7801375389 0.000000e+00
## 21  free.sulfur.dioxide total.sulfur.dioxide  0.6155009866 0.000000e+00
## 28 total.sulfur.dioxide              density  0.5298813581 0.000000e+00
## 49       residual.sugar              alcohol -0.4506312311 0.000000e+00
## 52 total.sulfur.dioxide              alcohol -0.4488920867 0.000000e+00
## 66              alcohol              quality  0.4355747104 0.000000e+00
## 29        fixed.acidity                   pH -0.4258582890 0.000000e+00
## 19       residual.sugar total.sulfur.dioxide  0.4014393091 0.000000e+00
## 50            chlorides              alcohol -0.3601887226 0.000000e+00
## 63              density              quality -0.3071234226 0.000000e+00
## 14       residual.sugar  free.sulfur.dioxide  0.2990983427 0.000000e+00
## 27  free.sulfur.dioxide              density  0.2942104042 0.000000e+00
## 2         fixed.acidity          citric.acid  0.2891806960 0.000000e+00
## 22        fixed.acidity              density  0.2653309703 0.000000e+00
## 26            chlorides              density  0.2572113872 0.000000e+00
## 51  free.sulfur.dioxide              alcohol -0.2501039505 0.000000e+00
## 60            chlorides              quality -0.2099344134 0.000000e+00
## 20            chlorides total.sulfur.dioxide  0.1989102960 0.000000e+00
## 57     volatile.acidity              quality -0.1947229654 0.000000e+00
## 32       residual.sugar                   pH -0.1941334456 0.000000e+00
## 62 total.sulfur.dioxide              quality -0.1747372150 0.000000e+00
## 31          citric.acid                   pH -0.1637482196 0.000000e+00
## 45                   pH            sulphates  0.1559514850 0.000000e+00
## 24          citric.acid              density  0.1495025158 0.000000e+00
## 3      volatile.acidity          citric.acid -0.1494718194 0.000000e+00
## 43 total.sulfur.dioxide            sulphates  0.1345623732 0.000000e+00
## 54                   pH              alcohol  0.1214321032 0.000000e+00
## 18          citric.acid total.sulfur.dioxide  0.1211307943 0.000000e+00
## 46        fixed.acidity              alcohol -0.1208811179 0.000000e+00
## 9           citric.acid            chlorides  0.1143644452 8.881784e-16
## 56        fixed.acidity              quality -0.1136628240 1.332268e-15
## 15            chlorides  free.sulfur.dioxide  0.1013923511 1.139311e-12
## 64                   pH              quality  0.0994272381 3.080647e-12
## 59       residual.sugar              quality -0.0975768268 7.724044e-12
## 12     volatile.acidity  free.sulfur.dioxide -0.0970119387 1.019163e-11
## 6           citric.acid       residual.sugar  0.0942116231 3.935585e-11
## 13          citric.acid  free.sulfur.dioxide  0.0940772220 4.195155e-11
## 36              density                   pH -0.0935915634 5.280354e-11
## 16        fixed.acidity total.sulfur.dioxide  0.0910697579 1.711435e-10
## 33            chlorides                   pH -0.0904394686 2.284974e-10
## 17     volatile.acidity total.sulfur.dioxide  0.0892605036 3.902969e-10
## 4         fixed.acidity       residual.sugar  0.0890206993 4.348371e-10
## 10       residual.sugar            chlorides  0.0886845365 5.057195e-10
## 48          citric.acid              alcohol -0.0757287294 1.119361e-07
## 44              density            sulphates  0.0744931176 1.795269e-07
## 8      volatile.acidity            chlorides  0.0705115721 7.824606e-07
## 47     volatile.acidity              alcohol  0.0677179396 2.100319e-06
## 5      volatile.acidity       residual.sugar  0.0642860606 6.712237e-06
## 39          citric.acid            sulphates  0.0623309389 1.268864e-05
## 42  free.sulfur.dioxide            sulphates  0.0592172444 3.369446e-05
## 65            sulphates              quality  0.0536778755 1.709793e-04
## 11        fixed.acidity  free.sulfur.dioxide -0.0493958555 5.437313e-04
## 38     volatile.acidity            sulphates -0.0357281491 1.239761e-02
## 30     volatile.acidity                   pH -0.0319153704 2.550817e-02
## 23     volatile.acidity              density  0.0271139368 5.776822e-02
## 40       residual.sugar            sulphates -0.0266643669 6.204414e-02
## 7         fixed.acidity            chlorides  0.0230856426 1.062094e-01
## 1         fixed.acidity     volatile.acidity -0.0226972941 1.122218e-01
## 55            sulphates              alcohol -0.0174327735 2.225307e-01
## 37        fixed.acidity            sulphates -0.0171429832 2.303158e-01
## 41            chlorides            sulphates  0.0167628806 2.408178e-01
## 58          citric.acid              quality -0.0092090871 5.193461e-01
## 61  free.sulfur.dioxide              quality  0.0081580672 5.681271e-01
## 35 total.sulfur.dioxide                   pH  0.0023209811 8.709954e-01
## 34  free.sulfur.dioxide                   pH -0.0006177853 9.655221e-01

Among top 5 correlations, three were related to density. Correlation between density and residual sugar was 0.84 ranking number one on the list while correlation between density and alcohol was -0.78 and correlation between total SO2 and density was 0.53 ranking two and four respectively. Considering the third largest correlation was between free SO2 and total SO2, two variables that are obviously correlated, it can be concluded that density has notably high correlations with other variables.

The correlations are quite strong. Since sugar and SO2 are more dense than water, density should increase if wine contains more sugar or SO2. In contrast, density should decrease if wine has higher alcohol level since alcohol is less dense than water. Therefore, these correlations make sense.

Alcohol is produced from fermantation. Since amount of sugar and yeast activity determines fermantation, alcohol should be correlated with them. Residual sugar will be negatively correlated with amount of sugar used during fermantation. Furthermore, as SO2 is an anti-microbial agent, the amount of SO2 will be negatively correlated with the degree of fermantation taken place. Therefore, negative correlations between alcohol and both variables make sense.

Two variables most closely related to quality are density and alcohol. However, scatterplots do not reveal the relationship clearly. Therefore, boxplots were drawn using levels instead of quality measures.

Now, it is visible that higher quality wines tend to have more alcohol but are less dense. However, as seen before, if a wine has more alcohol, it is going to be less dense. Moreover, the two variables closely related are themselves variables that are connected with several other variables as we have explored previously. Therefore, it can be concluded that no single variable determines quality of wine.

Excellent wines have lower median value for residual sugar and chlorides but the median values for citric acid do not differ greatly across different levels of wine. However, one thing distingushably different for excellent wines is that excellent wines have smaller variations in all three variables. Perhaps, it is a moderate amount of everything that makes excellent wine.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Correlations between various variables and quality were not very strong. In fact, the strongest correlations between a variable and quality occurred between density and quality and alcohol and quality. However, even those correlations were not strong. But, when graphs were drwan using categorical variables, levels, the relationship became visible. Higher quality wines tend to have more alcohol but are less dense.

Further analysis revealed that median values for residual sugar, citric acid, and chlorides do not differ greatly across different levels of wine. However, great wines have much smaller variations in all three variables as shown by the box plots. This led me to a new suspicion that what makes a wine a great wine is the blending of several tastes; in other words, a moderate amount of every feature makes an excellent wine. This will be investigated further in Multivariate Analysis.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Among top 4 correlations, three were related to density. Considering the third largest correlation was between free SO2 and total SO2, two variables that are obviously correlated, it can be concluded that density has notably high correlations with other variables. But, of course, density changes as substances with different density is added to the liquid (sugar being more dense and alcohol being less dense), these correlations make sense.

Also, alcohol was highly correlated with residual sugar and total amount of SO2. since amount of sugar and yeast activity determines fermantation, alcohol, a produce of fermentation, should be correlated with them. Residual sugar will be negatively correlated with amount of sugar used during fermantation. Furthermore, as SO2 is an anti-microbial agent, the amount of SO2 will be negatively correlated with the degree of fermantation taken place. Therefore, negative correlations between alcohol and both variables make sense.

What was the strongest relationship you found?

Correlation between density and residual sugar was 0.84 ranking number one on the list. Since sugar is more dense than water, density should increase if wine contains more sugar. Thus, the relationship did not deviate from what was expected.

Multivariate Plots Section

As visible from the graphs, green dots are located around the center of the distributions When residual sugar and citric acid or chlorides are plotted against one another. When one of the variables, whether it be residual sugar or chlorides or citric acid, is too high, the dots are blue or red. The high concentration of green dots around the center visually shows that a nice blending of all the flavors makes a wine excellent.

The distributions for free SO2, total SO2, volatile acid, and fixed acid all show similar patterns. Green dots are located around the center of the distritbutions, both horizontally and vertically. However, the central tendency is not as strong for the four variables compared to the distributions made with residual sugar, citric acid, and chlorides.

When ploitting against density and alcohol, it was visible that excellent wines are located at the section on the lower right corner. This means that excellent wines have higher alcohol concentration and lower density. However, I do not think that having higher dose of alcohol makes a wine excellent. Since most variables are negatively correlated with alcohol, having a large amount of certain variable would mean a decrease in alcohol concentration. Because excellent wines have moderate amount of everything, it will be more likely that they will have higher percentage of alcohol. And as alcohol and density are negatively correlated, excellent wines are bound to have less density values.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

My new hypothesis from Bivariate Analysis was confirmed. Looking at plots with x variable as residual sugar and y variable as citric acid or chlorides and coloring each dot by levels, I could see that excellent wines were located near the center of the distributions. This meant that excellent wines are wines that have moderate amount of all chemical compounds: not too much and not too little, just the right amount. In fact, the trend was consistent with other chemical characteristics such as free SO2, total SO2, fixed acidity, and volatile acidity.

Were there any interesting or surprising interactions between features?

Plotting distributions with alcohol and density as variables colored by levels revealed that excellent wines tend to have more alcohol and less density. Since alcohol is negatively correlated with most variables, this makes sense.


Final Plots and Summary

Plot One

## NULL

Description One

The distribution for residual sugar was skewed to the right to a high degree. However, when log transform was conducted to the distribution, it was visible that the distribution was bimodal.

Plot Two

Description Two

The median values for citric acid did not differ greatly across different levels of wine. However, one thing distingushably different for excellent wines is that excellent wines have smaller variations in all three variables. Perhaps, it is a moderate amount of everything that makes excellent wine.

Plot Three

Description Three

As visible from the graphs, green dots are located around the center of the distributions When residual sugar and citric acid are plotted against one another. When one of the variables, whether it be residual sugar or chlorides or citric acid, is too high, the dots are blue or red. The high concentration of green dots around the center visually shows that a nice blending of all the flavors makes a wine excellent.


Reflection

Frankly, I do not drink wine very often and personally cannot distinguish a “good” wine from a “bad” one. I chose this data set because I was curious how to differentiate a good wine from a bad one. Some wines are priced thousands of dollars! What makes such wines so special? Could I make a model that predicts price of a wine based on its chemical characteristics? What is the most important factor that determines the quality of wine. Questions like those were beginning of this research.

However, when I calculated correlation values for the data in the exploratory phase, I ran into a trouble. As shown above, the correlations were very low for most variables. If a relationship has a correlation whose absolute value is less than 0.3, then it is usually considered “no relationship.” Then there were only two variables considered to have a relationship with quality. Moreover, those two variables are alcohol and density, which themselves are dependent variables!

At first, I suspected that quality of wine is not determined by the wine’s chemical compositions. It may be the brand, the appearance of the bottle, or the sommelier’s mood at the time that determine the wine’s quality. Then I realized that quality variable was an ordinal variable; a wine with quality 9 is not three times better than a wine with quality 3. I tried to make a model that predicts quality value as we did for the diamond data set but only then I realized that price variable and quality variable were innately different. So I constructed a categorical variable, “levels” and looked for trends. Setting up level variable was the key decision that made this research possible.

Several patterns emerged after that. Unlike what I expected, the median values of many variables were not that different for wine of different levels. However, the ranges for the variables were visibly smaller for excellent wines. Indeed, fervid wine drinkers often look for completeness when they drink wine. In other words, they want wines with a variety of tastes blending together. My bivariate analysis and multivariate analysis supports the idea that a good wine is a wine that has a moderate amount of everything.

Lastly, it would be desirable to include price variable for future research. Exploring relationship between price of wine and quality of wine and the relationship between price of wine and chemical composition of wine would be very interesting. Furthermore, I want to conduct a comparative research with red wine data. I suspect that sommeliers look for different blending of tastes in red wine than in white wine and I want to see if that is true.